• 1 Link Community Data
    • 1.1 Overview
    • 1.2 Environment Setup
      • 1.2.1 Input/Output
      • 1.2.2 Load Libraries
      • 1.2.3 Load data
    • 1.3 Clean & Merge Data
      • 1.3.1 Subset Data
      • 1.3.2 Identify Geographic Key
      • 1.3.3 Merge Data
    • 1.4 Visualize Data
    • 1.5 Save Data

1 Link Community Data

1.1 Overview

Geographic location can serve as a “key” that links different datasets together. By referencing each dataset and enabling its spatial location, we can integrate different types of information in one setting. In this tutorial, we will use the approximated “service areas” generated in our buffer analysis to identify vulnerable populations during the COVID pandemic. We will connect Chicago COVID-19 Case data by ZIP Code, available as a flat file on the city’s data portal, to our environment.

We will then overlap the 1-mile buffers representing walkable access to the Methadone providers in the city. We use a conservative threshold because of the multiple challenges posed by the COVID pandemic that may impact travel.

Our final goal will be to identify zip codes most impacted by COVID that are outside our acceptable access threshold. Our tutorial objectives are to:

  • Clean data in preparation of merge
  • Integrate data using geographic identifiers
  • Generate maps for a basic gap analysis

1.2 Environment Setup

To replicate the codes & functions illustrated in this tutorial, you’ll need to have R and RStudio downloaded and installed on your system. This tutorial assumes some familiarity with the R programming language.

1.2.1 Input/Output

Our inputs include multiple CSV and SHP files, all of which can be found here, though the providers point file was generated in the Geocoding tutorial. Note that all four files are required (.dbf, .prj, .shp, and .shx) to constitute a shapefile.

  • Chicago Methadone Clinics, methadoneClinics.shp
  • 1-mile buffer zone of Clinics, methadoneClinics_1mi.shp
  • Chicago Zip Codes, chicago_zips.shp
  • Chicago COVID case data by Zip, COVID-19_Cases__Tests__and_Deaths_by_ZIP_Code.csv

We will calculate the minimum distance between the resources and the centroids of the zip codes, then save the results as a shapefile and as a CSV. Our final result will be a shapefile/CSV with the minimum distance value for each zip.

1.2.2 Load Libraries

We will use the following packages in this tutorial:

  • sf: to manipulate spatial data
  • tmap: to visualize and create maps
  • units: to convert units within spatial data

Load the libraries for use.

library(sf)
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library(tmap)

1.2.3 Load data

First we’ll load the shapefiles.

chicago_zips <- st_read("data/chicago_zips.shp")
## Reading layer `chicago_zips' from data source `/Users/maryniakolak/code/opioid-environment-toolkit/data/chicago_zips.shp' using driver `ESRI Shapefile'
## Simple feature collection with 85 features and 9 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -88.06058 ymin: 41.58152 xmax: -87.52366 ymax: 42.06504
## CRS:            4326
meth_sf <- st_read("data/methadoneClinics.shp")
## Reading layer `methadoneClinics' from data source `/Users/maryniakolak/code/opioid-environment-toolkit/data/methadoneClinics.shp' using driver `ESRI Shapefile'
## Simple feature collection with 27 features and 8 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -87.7349 ymin: 41.68698 xmax: -87.57673 ymax: 41.96475
## CRS:            4326
buffers<- st_read("data/methadoneClinics_1mi.shp")
## Reading layer `methadoneClinics_1mi' from data source `/Users/maryniakolak/code/opioid-environment-toolkit/data/methadoneClinics_1mi.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 1141979 ymin: 1824050 xmax: 1195960 ymax: 1935751
## CRS:            3435

Next, we’ll load some new data we’re interested in joining in: Chicago COVID-19 Cases, Tests, and Deaths by ZIP Code, found on the city data portal here. We’ll load in a CSV and inspect the data:

COVID <- read.csv("data/COVID-19_Cases__Tests__and_Deaths_by_ZIP_Code.csv")

1.3 Clean & Merge Data

First, let’s inspect COVID case data. What information do we need from this file? We may not need everything, so consider just identifying the field with the geographic identifier and main variable(s) of interest.

head(COVID)
##   ZIP.Code Week.Number Week.Start   Week.End Cases...Weekly Cases...Cumulative
## 1    60603          39 09/20/2020 09/26/2020              0                 13
## 2    60604          39 09/20/2020 09/26/2020              0                 31
## 3    60611          16 04/12/2020 04/18/2020              8                 72
## 4    60611          15 04/05/2020 04/11/2020              7                 64
## 5    60615          11 03/08/2020 03/14/2020             NA                 NA
## 6    60603          10 03/01/2020 03/07/2020             NA                 NA
##   Case.Rate...Weekly Case.Rate...Cumulative Tests...Weekly Tests...Cumulative
## 1                  0                 1107.3             25                327
## 2                  0                 3964.2             12                339
## 3                 25                  222.0            101                450
## 4                 22                  197.4             59                349
## 5                 NA                     NA              6                  9
## 6                 NA                     NA              0                  0
##   Test.Rate...Weekly Test.Rate...Cumulative Percent.Tested.Positive...Weekly
## 1               2130                27853.5                              0.0
## 2               1534                43350.4                              0.0
## 3                312                 1387.8                              0.1
## 4                182                 1076.3                              0.1
## 5                 14                   21.7                               NA
## 6                  0                    0.0                               NA
##   Percent.Tested.Positive...Cumulative Deaths...Weekly Deaths...Cumulative
## 1                                  0.0               0                   0
## 2                                  0.1               0                   0
## 3                                  0.2               0                   0
## 4                                  0.2               0                   0
## 5                                   NA               0                   0
## 6                                   NA               0                   0
##   Death.Rate...Weekly Death.Rate...Cumulative Population   Row.ID
## 1                   0                       0       1174 60603-39
## 2                   0                       0        782 60604-39
## 3                   0                       0      32426 60611-16
## 4                   0                       0      32426 60611-15
## 5                   0                       0      41563 60615-11
## 6                   0                       0       1174 60603-10
##              ZIP.Code.Location
## 1 POINT (-87.625473 41.880112)
## 2 POINT (-87.629029 41.878153)
## 3 POINT (-87.620291 41.894734)
## 4 POINT (-87.620291 41.894734)
## 5 POINT (-87.602725 41.801993)
## 6 POINT (-87.625473 41.880112)

From this we can assess the need for the following variables: ZIP Code and Percent Tested Positive - Cumulative. Let’s subset the data accordingly. Because this data file has extremely long header names (common in the epi world), let’s use the colnames function to get exactly what we need.

colnames(COVID)
##  [1] "ZIP.Code"                            
##  [2] "Week.Number"                         
##  [3] "Week.Start"                          
##  [4] "Week.End"                            
##  [5] "Cases...Weekly"                      
##  [6] "Cases...Cumulative"                  
##  [7] "Case.Rate...Weekly"                  
##  [8] "Case.Rate...Cumulative"              
##  [9] "Tests...Weekly"                      
## [10] "Tests...Cumulative"                  
## [11] "Test.Rate...Weekly"                  
## [12] "Test.Rate...Cumulative"              
## [13] "Percent.Tested.Positive...Weekly"    
## [14] "Percent.Tested.Positive...Cumulative"
## [15] "Deaths...Weekly"                     
## [16] "Deaths...Cumulative"                 
## [17] "Death.Rate...Weekly"                 
## [18] "Death.Rate...Cumulative"             
## [19] "Population"                          
## [20] "Row.ID"                              
## [21] "ZIP.Code.Location"

1.3.1 Subset Data

We can now subset to just include the fields we need. There are many different ways to subset in R – we just use one example here! Inspect your data to confirm it was pulled correctly.

COVID.sub <- COVID[, c("ZIP.Code", "Case.Rate...Cumulative")]
head(COVID.sub)
##   ZIP.Code Case.Rate...Cumulative
## 1    60603                 1107.3
## 2    60604                 3964.2
## 3    60611                  222.0
## 4    60611                  197.4
## 5    60615                     NA
## 6    60603                     NA

1.3.2 Identify Geographic Key

Before merging, we need to first identify the geographic identifier we would like to merge on. It is the field “ZIP.Code” in our subset. What about the zip code file, which we will be merging to?

head(chicago_zips)
## Simple feature collection with 6 features and 9 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -88.06058 ymin: 41.73452 xmax: -87.58209 ymax: 42.04052
## CRS:            4326
##   ZCTA5CE10 GEOID10 CLASSFP10 MTFCC10 FUNCSTAT10  ALAND10 AWATER10  INTPTLAT10
## 1     60501   60501        B5   G6350          S 12532295   974360 +41.7802209
## 2     60007   60007        B5   G6350          S 36493383   917560 +42.0086000
## 3     60651   60651        B5   G6350          S  9052862        0 +41.9020934
## 4     60652   60652        B5   G6350          S 12987857        0 +41.7479319
## 5     60653   60653        B5   G6350          S  6041418  1696670 +41.8199645
## 6     60654   60654        B5   G6350          S  1464813   113471 +41.8918225
##     INTPTLON10                       geometry
## 1 -087.8232440 MULTIPOLYGON (((-87.86289 4...
## 2 -087.9973398 MULTIPOLYGON (((-88.06058 4...
## 3 -087.7408565 MULTIPOLYGON (((-87.77559 4...
## 4 -087.7147951 MULTIPOLYGON (((-87.74205 4...
## 5 -087.6059654 MULTIPOLYGON (((-87.62623 4...
## 6 -087.6383036 MULTIPOLYGON (((-87.64775 4...

Aha – in this dataset, the zip is identified as either the ZCTA5CE10 field or GEOID10 field. Not that we are actually working with 5-digit ZCTA fields, not 9-digit ZIP codes… We decide to merge on the GEOID10 field. To make our lives easier, we’ll generate a duplicate field in our subset with a new name, GEOID10, to match. We also convert from the factor structure to a character field to match the data structure of the master zip file.

COVID.sub$GEOID10<- as.character(COVID.sub$ZIP.Code)

1.3.3 Merge Data

Let’s merge the data using the zip code geographic identifier, “ZIP Code” field, to bring in the the Percent Tested Positive - Cumalative dataset. Inspect the data to confirm it merged correctly!

zips_merged <- merge(chicago_zips, COVID.sub, by = "GEOID10")
head(zips_merged)
## Simple feature collection with 6 features and 11 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -87.63396 ymin: 41.88083 xmax: -87.6129 ymax: 41.88893
## CRS:            4326
##   GEOID10 ZCTA5CE10 CLASSFP10 MTFCC10 FUNCSTAT10 ALAND10 AWATER10  INTPTLAT10
## 1   60601     60601        B5   G6350          S  934226    60682 +41.8856419
## 2   60601     60601        B5   G6350          S  934226    60682 +41.8856419
## 3   60601     60601        B5   G6350          S  934226    60682 +41.8856419
## 4   60601     60601        B5   G6350          S  934226    60682 +41.8856419
## 5   60601     60601        B5   G6350          S  934226    60682 +41.8856419
## 6   60601     60601        B5   G6350          S  934226    60682 +41.8856419
##     INTPTLON10 ZIP.Code Case.Rate...Cumulative                       geometry
## 1 -087.6215226    60601                 1451.4 MULTIPOLYGON (((-87.63396 4...
## 2 -087.6215226    60601                  872.2 MULTIPOLYGON (((-87.63396 4...
## 3 -087.6215226    60601                  919.9 MULTIPOLYGON (((-87.63396 4...
## 4 -087.6215226    60601                  558.8 MULTIPOLYGON (((-87.63396 4...
## 5 -087.6215226    60601                  156.7 MULTIPOLYGON (((-87.63396 4...
## 6 -087.6215226    60601                  477.0 MULTIPOLYGON (((-87.63396 4...

```

1.4 Visualize Data

Now we are ready to visualize our data! First we’ll make a simple map. We generate a choropleth map of case rate data using quantile bins, and the Blue-Purple color palette. You can find an R color cheatsheet useful for identifying palette codes here. (More details on thematic mapping are in tutorials that follow!) We overlay the buffers and clincal providers.

tm_shape(zips_merged) +
  tm_polygons("Case.Rate...Cumulative", style="quantile", pal="BuPu",
              title = "COVID Case Rate") +
  tm_shape(buffers) + tm_borders(col = "blue") +
  tm_shape(meth_sf) + tm_dots(col = "black", size = 0.2) 

Already we can generate some insight. Areas on the far West side of the city have some of the highest case rates, but are outside a walkable distance to Methadone providers. For individuals with opioid use disorder requiring medication access in these locales, they may be especially vulnerable during the pandemic.

Next, we adjust some tmap parameters to improve our map. Now we switch to a red-yellow-green palette, and specify six bins for our quantile map. We flip the direction of the palette using a negative sign, so that red corresponds to areas with higher rates. We adjust transparency using an alpha parameter, and line thickness using the lwd parameter.

tm_shape(zips_merged) +
  tm_fill("Case.Rate...Cumulative", style="quantile", n=6, pal="-RdYlGn",
              title = "COVID Case Rate",alpha = 0.8) + 
  tm_borders(lwd = 0) + 
  tm_shape(buffers) + tm_borders(col = "gray") + tm_fill(alpha=0.5) +
  tm_shape(meth_sf) + tm_dots(col = "black", size = 0.1) 

To improve our map even further, let’s make in interactive! By switching the tmap_mode function to “view” (from “plot” the default), our newly rendered map is not interactive. We can zoom in and out, click on different basemaps or turn layers on our off, and click on resources for more information.

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(zips_merged) +
  tm_fill("Case.Rate...Cumulative", style="quantile", n=6, pal="-RdYlGn",
              title = "COVID Case Rate",alpha = 0.8) + 
  tm_borders(lwd = 0) + 
  tm_shape(buffers) + tm_borders(col = "gray") + tm_fill(alpha=0.5) +
  tm_shape(meth_sf) + tm_dots(col = "black", size = 0.1) 
## Warning in `$.crs`(gm$shape.master_crs, "proj4string"): CRS uses proj4string,
## which is deprecated.

## Warning in `$.crs`(gm$shape.master_crs, "proj4string"): CRS uses proj4string,
## which is deprecated.
## Warning in `$.crs`(crs, "proj4string"): CRS uses proj4string, which is
## deprecated.
## Warning in `$.crs`(gm$shape.master_crs, "proj4string"): CRS uses proj4string,
## which is deprecated.

Using this approach, it’s clear that the zip codes on the West Side most vulnerable are 60651, 60644, 60632, and 60629. By updating the thresholds and parameters further, these can shift as well to be more or less conservative based on our assumptions.

1.5 Save Data

We save our newly merged ZCTA level data for future analysis. The st_write function does the trick. Uncomment, and run on your system!

write_sf(zips_merged, "data/chizips_COVID.shp")
## Warning in abbreviate_shapefile_names(obj): Field names abbreviated for ESRI
## Shapefile driver